Supplement to: Estimation in discrete time coarsened multivariate longitudinal models

This document contains three appendices. Appendix A provides a derivation of the joint probability mass function (pmf) of the observed data and the ignorable likelihood function

2 Appendix A: The log-likelihood function and its partial derivatives 2.1 Notation A list of key notation introduced in this paper is provided in Supplementary Table 1 below. Recall in particular that D equals the first time of absorption or censoring, ∆ is an indicator of right censoring, and C t := I(D −∆ ≥ t).
Let Y be the the sample space of Y that only includes realizations with nonzero probability, and let V and Z be the corresponding sample spaces for V and Z. The observed data Y obs and Z obs , and V obs are represented using functions of the latent data Y and Z, and V using notations similar that used by [1] on coarsened data, but adapted to this situation where data coarsening is not monotone. For realizations of the latent data v, y, z we obtain the observed data as functions of the latent data where c, d, and δ are omitted since these follow from y obs and v obs . Let [t] appended to a process indicate all data within the interval 1 to t, e.g. Y [t] = (Y 1 , . . . , Y t ), defined analogously for realizations and sample spaces, and define the functions in (1) analogously when applied on data between times 1 to t.  G 1 v obs (y) equal to y obs , the observed part of y G 2 c (z) equal to z obs , the observed part of z G 3 c (v) equal to v obs , the observed part of v Note that C, D and ∆ can be computed using Y obs and V obs and are therefore also observed. To see this, recall that T 2 (V obs ) returns the first time t when all components of V obs t are zero, which is equal to the first time when C t is zero, if this occurs. If not, then T 2 (V obs ) is equal to τ + 1 and C t = 1 for all t. Thus, C is known, and therefore D − ∆. Also note that an absorption occurred at the last time t where C t = 1 only if n j=nna+1 Y obs j,t > 0. This determines ∆ and D.

The latent likelihood contribution
Since Z is external, the joint pmf of (Y, Z), denoted as P r θ,ψ (y, z), is proportional, with respect to θ, to a product of the conditional pmf of Y t where ∝ denotes that the right is proportional to the left with respect to θ. The last row in the above display is equal to the latent likelihood contribution Note that the first product above only needs to range from t = 1 until min(τ, T 1 (y)) for the following reason.
If event j cannot occur at time t given H t−1 = h t−1 then b j,t = 0. In this case then we always observe y j,t = 0 and {α θ (j, t|h t−1 , z t )b j,t } yj,t = 1. Also, y j,t = 0 for all j with t > T 1 (y).

The likelihood function of the observed data
We derive the observed data likelihood function which is induced by the latent data pmf. The joint pmf at the observed data is Using the latent data joint probability mass function, (4) can be written as a sum over all realizations of the latent variables compatible with the observed v∈V: Therefore, (4) is proportional to y∈Y: G 1 v obs (y)=y obs z∈Z: We now simplify (7) by summing the components of y and z after d − δ. This is possible since the condition . Because of this, the sum of all conditional probabilities of the components of y and z after d − δ given y [d−δ] and z [d−δ] must equal 1. Therefore, (7) is equal to Repeating the arguments that lead to P r θ,ψ (Y = y, Z = z) ∝ τ t=1 P r θ (Y t = y t |H t−1 = h t−1 , Z t = z t ) on the terms in above display, the joint probability mass of the observed data (4) seen to be proportional to what we define as the ignorable likelihood contribution, ignoring the coarsening L ign (θ; y obs , v obs , z obs , d, δ) := For N independent subjects indexed by k the ignorable likelihood function is proportional to

Partial derivatives of the ignorable log-likelihood
In the following, we compute the first and mixed second order partial derivatives of (10) with respect to θ.
First, we simplify the notation. For a fixed realization [d−δ] , and write y L latent (θ; y) instead of the more cumbersome sum in the right hand side of (9). Clearly, y L latent (θ; y) > 0. The log-likelihood function involves logarithms of sums over finite sets on the form log{ y L latent (θ; y)}.
When there is no coarsening, the sum above consists of only one term and becomes log L latent (θ) which is the fully observed log-likelihood function.

Partial derivatives with only right censoring
We first compute the derivatives of the log-likelihood contribution in (10) in absence of coarsening. For each time t, then y t is a column vector and P t is a row vector. Define the local likelihood contribution as Note that exactly one factor is not equal to 1 since exactly one of the powers are 1 and the others are zero (say y f (t) = 1 for an index f and all other y j,t = 0 for j ̸ = f ). Also, whenever y j,t = 1 then b j,t = 1, so y t = y t · b t .
so we may write Its logarithm is where the exponent and logarithm functions act element wise. The log-likelihood contribution is therefore Expressions for the partial derivatives, ∂ ∂θj ℓ t (θ), and the entries of the Hessian, ∂ 2 ∂θ l ∂θj ℓ t (θ), are provided below.

Partial derivatives with general coarsening
With coarsening, the partial derivatives of (11) are .
The mixed second order partial derivatives are By the product rule, the derivatives can be expressed in terms of the time local latent likelihood L latent Expressions for the partial derivatives, ∂ ∂θj L latent t (θ), and the entries of the Hessian, ∂ ∂θ l ∂ ∂θj L latent (θ), are provided below.

Partial derivatives of related functions
In the following we compute all partial derivatives required to compute the gradients and Hessians of the log-likelihoods as described above. We first introduce some additional definitions and notation. Define n nonabs := n na +1 to be the number of non-absorbing events including 0, and recall that n = n na +n a is number of types of events except for the reference event. For a time t, let X t (1 × k) be a design matrix including intercept and all covariates present in all linear predictors. It may contain arbitrary summaries of the history of the process up until time t and may be used to define the current state of the subject if that is meaningful. Recall that θ is a (1 × p) column vector. Define the linear parameter map ϕ := ϕ(θ) (n × k) that maps the components of θ to where they are active in all n linear predictors η j in the (1×n) vector η t = (X t ϕ T ).
Note that we allow linear predictors to share parameters, if desired.
In the following, let b be a realization of B T t , so, in contrast with the above, b t has now dimension 1 × n + 1. For a vector a, let a − be a vector equal to a except that the first element has been omitted. Since the likelihood contribution is only evaluated at times when a subject is at risk, we always have b 0,t = 1 for t ≤ d − δ.
Let the denominator of the discrete hazard defined by the baseline category model be Recall that for a given time t and data h t−1 , z t , the j th discrete hazard function P j,t is identified with , and the 1 × n + 1 vector of probabilities P t is Let

The log-likelihood without coarsening
The partial derivatives of the log-likelihood ∂ ∂θj ℓ t (θ) are given by and the Hessian is the p × p matrix with entries on the form ∂ 2 ∂θ l ∂θj ℓ t (θ) with We now derive expressions for ∂ ∂θj log(P bt t )y t and ∂ ∂θ l ∂ ∂θj log(P bt t )y t . Brackets after expressions indicate the dimension of the corresponding matrix.
The partial derivative of the linear map ϕ is It is an indicator matrix with elements equal to 1 where the parameter θ j is in the corresponding cell and else 0. Hence, the partial derivative of the linear predictors is Note that ∂ ∂θj η t is just a constant so differentiating it again with respect to ∂ ∂θ l gives the zero vector Moreover, and the partial derivative of Ref t is the scalar To compute ∂ ∂θj log(P bt t ) we first compute the partial derivative of P t , which is where Differentiation of each component log(P bi,t i,t ) in log(P bt t ) results in 0 when b i,t = 0, and when b i,t = 1 then P i,t > 0 and and the mixed second order partial derivative is where the partial derivative of A t (j) has the same dimension as A t (j) and is equal to where the partial derivative of P − t is with and

The latent likelihood under coarsening
Recall that the partial derivative of the latent likelihood is The partial derivative of the local likelihood contribution is where ∂ ∂θj P t is given by (29). Dividing by L latent The only remaining unknown term in (18) were the partial derivatives are given by (19) and (34) correspondingly.

Shrinking support
We introduce some restrictions on the support of Y which will allow us to relatively easy construct a proposal pmf that incorporates information from the past and the future and guarantees that it will not propose realizations outside the support of the target pmf.
The support of Y t given the past is assumed to shrink with time in the following sense. For each event j there is an index set F j ⊂ {0, . . . , n}, such that F j are the only events that have a possibility of occurring after event j.
We assume that event 0 does not prevent any other event from occurring at a later time, so F 0 = {0, . . . , n}, and that the reference event 0 can always occur with positive probability given any past events prior to absorption, so 0 ∈ F j for j = 0, . . . , n na . Since nothing occurs after absorption, F j = ∅ for j = n na + 1, . . . , n. nonabsorbing events with event 0. This implies that V j,t = 0 for some j ̸ = 0 whenever event j is coarsened, and else V j,t = j.

Generating compatible realizations
We explicitly define what a compatible realization means in the following. Note that a realization satisfies y ∈ Y if and only if This allows us to generate events sequentially at times during which it is unknown exactly which event that occurred for the following reason. Say that we have generated a compatible sequence of events up until time t−1 or that all events are observed up until t−1, then we will have that 1≤s<t I(e t ∈ F es )I(e t ∈ {j : y obs j,t = 1}) = 1 as long as e t ∈ {j : y obs j,t = 1}. We cannot, however, always evaluate I(e s ∈ F et ) for all s > t since these events have not yet been sampled at times where it is unknown which event that occurred. This is not an issue as long as we can pick e t such that there is at least one sequence of future events that can occur given the past and observed future. When e s is observed then check if e s ∈ F et , and at times s when e s is unknown, i.e. when n j=0 y obs j,s > 1, there is always a chance that event 0 occurred since y obs 0,s = 1, since 0 ∈ F j for j = 0, . . . , n na , F 0 = {0, . . . , n}, and since an absorption has not yet occurred ( n j=na+1 y obs j,s = 0). We are therefore certain that there is always some event future sequence of events e s ∈ F et that are also compatible with the model and observed data, and it is enough to check e s ∈ F et at times where e s is observed.
The proposal is constructed as follows. Consider observed events e obs t that are equal to * at times where it is unknown what event occurred ( n j=0 y obs j,t ≥ 2). Let J be an indicator matrix of same dimension as y obs with components J j,t = 1 if event j can occur at time t given y obs s and v obs s for s ≥ t, else 0, corresponding to the last two factors in (42). That is, for each 1 ≤ t < d − δ and j = 0, . . . , n set J j,t := y obs j,t t<s≤d−δ: e obs s ̸ = * and at t = d − δ, set J t = y obs t , since there is no future information available that can further restrict the set of possible events at time t.

Importance sampling
Sampling is performed sequentially from t = 1, . . . , d − δ and at each t a sample y t is generated. The conditional proposal pmf at time t given the past is set to Q j,t := Jj,tPj,t n j=0 Jj,tPj,t . Note that the support of Q t is the set of events j such that b j,t J j,t = 1. Clearly, this set is always nonempty: event 0 is always in the support if the event is unknown at t (e obs t = * ) since J 0,t = 1, and if the event is known then it is in the support since it must be compatible with the current observation and future observed events (J e obs t ,t = 1), and with the past events. The proposal pmf of the entire event history is the product where A T indicates the transpose of the matrix A, and the likelihood of the sample is proportional to Note that it is possible that Q t places all probability mass at one event j, e.g. if y obs j,t = 1, in which case the sampled realization satisfies y j,t = 1 and y k,t = 0 for k ̸ = j. The sampling algorithm is summarized in Algorithm (1) in the main article, and is repeated m times before the normalized weightsw are computed for each individual.

3.3.1
The score and Hessian of Q(θ, θ r ), and the observed information matrix Recall that Y obs is all the observed data for all subjects, and Y is the latent data for all subjects. Let the gradient and Hessian of the latent log-likelihood be Formulas for the gradient and Hessian of Q(θ, θ r ) are given by [2]. The gradient is and the Hessian is Following [3] or [4], the observed information matrix used to compute standard errors is obtained at the last iteration and is equal to the negative of The score and Hessian of Q and the observed information matrix can all be approximated given Monte Carlo samples and importance weights, as discussed by [3]. Recall that k indicates a subject, j indicates a Monte Carlo sample, and in the following the indices i, l indicate components of the score and Hessian, e.g. S i (y) and H i,l (y). Letỹ k,j be sample j of subject k of the latent data generated by Algorithm (1), andw k,j is the corresponding self-normalized weight, normalized across j for fixed k.
It follows that k,jw k,j log L latent (θ;ỹ k,j , z k , d k , δ k ) is a Monte Carlo approximation of Q(θ|θ r ) when Monte Carlo sampling has been performed with θ r as parameter vector. Similarly, the expectation of the score in (45) can be approximated by and the expectation in (46) and the first term in (47) can be approximated by The second term in (47) is equal to and the expectation of the products of score components in the end of (50) can be approximated by while the second term in the end of (50) is a double sum of expectations each approximated by (48). The third term in (47) is approximately equal to zero at convergence.

Parameter specification and convergence criteria
Implementing the MCEM algorithm involves making choices such as the number of Monte Carlo samples at each iteration, parameter updating rules (e.g. step size), convergence criteria, and burn-in. There are many combinations of choices and alternative approaches that can be considered in future work to improve the performance of the method, compare with e.g. [5] and [6]. The following choices were made to keep the implementation simple while making the simulation studies in the main article feasible, and parameters were chosen based on a preliminary set of simulations.
The parameter were updated according to Algorithm (2) in the main article. First, the parameter sequence  Table 5 for a summary of the simulation settings. No burn-in and convergence criteria was used, and instead, the algorithm was allowed to run until the maximal number of iterations (10) was reached. Q was maximized iteratively based on the Monte Carlo estimates of the expected the gradient and Hessian of the latent likelihood.
4 Appendix C: additional simulation details and results

Data generating model parameters
Supplementary GnRH or AA, S1 = received AA but not GnRH, S2 = received GnRH. S2 had S1 is an indicator variable equal to 1 only if the subject is in state S2 and has previously been in state S1, else 0.